Efficient ETL

Effie Lo is a sales executive at the company where analyst Val Driveno works. Effie needs a dashboard built for her sales team that will help them track their interactions with customers during a campaign that starts…gasp…this week! And while this would be a great opportunity for the company to utilize Value Driven Analytics speedy dashboard consulting services for “fire drill” situations, unfortunately this one is on Val. She needs to build an automated dashboard that combines data sources as varied as a large Salesforce object, an Excel file on a shared network drive, and data that needs to be pulled from a Python API.
Sadly, none of these are currently available in the company’s SQL database! Thus, her first task is to create a process that will migrate these data sources to the SQL database and automatically keep them up to date on a daily basis. She can use traditional ETL tools and processes, but this typically takes weeks before the new tables are up and running. She could create 3 separate ad hoc Python jobs to import the data and write it to the database, and then build 3 separate Task Scheduler schedules for each process; this approach could take hours to complete, leaving her little time for actually building the dashboard, and would add 3 more disconnected processes to the company’s scattered ETL jobs. It’s looking like the sales team will either have to delay their campaign or start it without access to data; but could there possibly be another solution that would allow Val to conduct these ETL operations efficiently and in an organized manner?
With “Efficient ETL”, the action of setting up the ETL process goes from taking hours or weeks to just minutes! Just as importantly, it keeps ETL processes organized in one place and allows for automated process monitoring.
Before we dive into how Efficient ETL works, let’s examine the manner in which most organizations handle ETL and analytics projects in general. Too often, analysts and data scientists think about individual data requests and analytics projects in silos. For Project A, an ETL process (extract, transform, load…the process of building a table and keeping it up to date in an automated manner) is built and automated and an interactive dashboard is built and automated on that. When a very similar Project B is requested, a completely separate process starts all over again. Another ETL process, another automation schedule, and another interactive dashboard are created. At best, the analyst identified the opportunity to copy and paste some code from project A and adapt it to project B from there. At worst, they started over from scratch, especially if Project A was performed by a different analyst.
This is where the concept of a “reusable asset” can come in and dramatically increase the analysts’ efficiency. It is important for each analyst and even teams as a whole to periodically stop and reflect on where most of their time is being spent. We’re not talking about individual projects here, but more of a bucketing of time into categories like “extracting data from Excel into the SQL database”, “pulling ad hoc customer data”, etc…Focusing on the buckets where the most time is spent in particular, brainstorm on how these tasks could be automated. Which parts of the process are performed in the same way for essentially all requests in the bucket? These can perhaps be coded into a Python job; let’s call these the “automated actions” of the reusable asset. Which inputs tend to be different for each request in the bucket? These can perhaps be set up in an easy-to-update input source (like an Excel worksheet) that the Python job references; let’s call these the “customizable inputs” of the reusable asset. Even if the whole task can’t be completely automated, could some kind of reusable asset be created that would bring the time required down from hours to minutes (i.e. partial automation)? Most people don’t think of analytics as being a “creative” profession, but this is a time where creative thinking is absolutely crucial!
“Efficient ETL” is, at its core, a reusable asset applied to one or multiple types of ETL. Perhaps in the brainstorm exercise above, the analyst or team identifies that they spend a lot of time creating individual processes that upload Excel files, Salesforce queries, SharePoint files, etc… to a SQL database. After some creative thinking, the team might decide to create a single folder containing a txt file for each Excel file that needs to be uploaded to a SQL database table. Each txt file would contain a reference to an Excel worksheet and the name of the SQL table it should be uploaded to. Likewise with Salesforce queries, SharePoint files, and other data sources. Even various Python jobs could be put in a single folder. These are all examples of customizable inputs of the reusable asset. Next, a single scheduled Python process could be created that loops through each of the txt files in each folder, pulls the data based on the parameters in the file, and uploads the data to the SQL database table specified in the file! This would be an example of an automated action of the reusable asset.
With “Efficient ETL”, the action of setting up the ETL process goes from taking hours or weeks to just minutes! Just as importantly, it keeps ETL processes organized in one place and allows for automated process monitoring.
What is the return for creating this single scheduled Python process? Now, whenever a new Excel file needs to be uploaded to and maintained in a SQL database table, it’s as simple as dropping a txt file in a folder! No need to schedule anything and no need to copy or write any code, leading to dramatic time savings. Not only that, but now all of the Excel files that get uploaded to SQL database tables can be found in a single folder (no more hunting for processes!), and modifying any given one can be done in seconds. Even more, that single scheduled Python process can be set up to handle tracking and error notifications for all the jobs it runs automatically. We recommend incorporating code that sends a single e-mail after the process has completed with a summary of all the jobs that ran that morning.
While the basic setup described above is ideal for ETL jobs that simply need to be ran sometime in the morning (not necessarily at a specific time) and that run fairly quickly (generally < 10 minutes), it can be modified to handle just about any requirements. Imagine scheduling this process to run every 5 minutes in Task Scheduler, checking to see which tasks need to be ran at that specific time based on a reference file or characters in the file name, and executing them as separate independent Python jobs that can be ran concurrently. Think of how this process could not only be set up for uploads of data to a SQL database table, but other types of data processing as well; imagine what types of other checks and notifications could be built into the process if necessary; this is where even more creativity comes in. The sky is the limit!
With even a basic Efficient ETL system in place, Val is able to complete the ETL part of project in less than an hour and spend the rest of the time building an actionable interactive dashboard for the sales team, al in time for the new campaign debut! And if similar requests come up for future sales campaigns, Val may even decide to create a single reusable asset dashboard!
At Value Driven Analytics, we can help you set up an Efficient ETL process of your own, helping your BI, analytics, and data science teams become dramatically more efficient. We’ll customize the solution to the types of ETL processes that your teams most often face, incorporate best practices, and add any customized features and alert systems your organization desires. We apply more than a dozen years of analytics experience and innovation to efficiently create your solution. Rather than sell your organization a single analytics solution for a big ticket price, our goal is to use this project as an opportunity to illustrate the rigor, affordability, and speed of our analytics solutions in order to become your #1 choice for future analytics projects. We can also train your analytics team on coding and efficiency techniques, including how to create reusable assets.
Learn more about how an Efficient ETL process can dramatically increase your team’s productivity.


